Enhancing global preparedness during an ongoing pandemic from partial and noisy data

Abstract As the coronavirus disease 2019 spread globally, emerging variants such as B.1.1.529 quickly became dominant worldwide. Sustained community transmission favors the proliferation of mutated sub-lineages with pandemic potential, due to cross-national mobility flows, which are responsible for consecutive cases surge worldwide. We show that, in the early stages of an emerging variant, integrating data from national genomic surveillance and global human mobility with large-scale epidemic modeling allows to quantify its pandemic potential, providing quantifiable indicators for pro-active policy interventions. We validate our framework on worldwide spreading variants and gain insights about the pandemic potential of BA.5, BA.2.75, and other sub- and lineages. We combine the different sources of information in a simple estimate of the pandemic delay and show that only in combination, the pandemic potentials of the lineages are correctly assessed relative to each other. Compared to a country-level epidemic intelligence, our scalable integrated approach, that is pandemic intelligence, permits to enhance global preparedness to contrast the pandemic of respiratory pathogens such as SARS-CoV-2.


I
Ph ylogeneticReconstruction S2 I. 1 Genomic dataset compilation S2 I. 2 Phylogenetic estimates of epidemiological parameters S2 I. 3 Estimates based on epidemic modeling S2 II Import Risk estimation S5 II. 1 International travel dataset compilation S5 II. 2 Quantifying the Import Risk S6 II. 3 Relation to distance and arrival time S7 II. 4 Alternative distance measures S8 II. 5 Data for arrival time and outbreak region S9 II. 6 Outbreak detection based on 1st count GISAID data S9 III Epidemic Scenarios S12 III. 1 Renewal equation S12 III. 2 A fully worked out example: the Alpha variant S13 III. 3 Prediction error S14 IV Pandemic delay S17 V Information Distance S19 -time of most recent common ancestor -outbreak country = -exp. growth rate -molecular rate of evo.
-outflow from source -cntr. specific import probability Worldwide epidemic data (OWID)  S1. Schematic mechanistic pipeline workflow. The three pieces of our pipeline (black dashed boxes) are illustrated and what input they get from external data sources (orange colored) or from the output of earlier parts of the pipeline (blue arrows). The t0 close to the orange arrows means that data from the external sources is used at or prior to t0, which is the estimated time of the most recent ancestor (the output of the first part of the pipeline, the phylogenetic reconstruction).

Genomic dataset compilation
We retrieved all SARS-CoV-2 sequences belonging to the Alpha B.1.1.7, Delta B.1.617.2, Omicron B.1.1.529 (BA.1), BA.2, BA.5, and BA.2.75 lineages from GISAID. Each genomic dataset was filtered by only retaining those sequences that were generated from cases reported during the initial wave and from the country of evolutionary origin, up to a total of 100 sequences per lineage. We then generated 3 alignments using MAFFT 7.505 [1], each comprised of 20%, 50% and 100% of the total number of sequences, which were subsequently cleaned by trimming the 5 0 and 3 0 untranslated regions and gap-only sites.

Phylogenetic estimates of epidemiological parameters
We performed a common Bayesian evolutionary reconstruction of timed phylogenetic history using BEAST 1.10.5 [2] that was source compiled from its GitHub repository (https://github.com/beast-dev/beast-mcmc). We modelled the nucleotide substitution process according to a HKY 85+ parameterisation, setting a strict molecular clock and an exponential growth model as coalescent prior. We used a Lognormal(µ = 9 ⇥ 10 4 , 2 = 1 ⇥ 10 5 ) prior for the molecular rate of evolution, a Laplace(µ = 0, b = 100) prior for the rate of exponential growth and a Lognormal(µ = 5.7, 2 = 2.3) prior for the exponentially growing viral population size. We further set an initial calibration for the time of the most recent common ancestor (tMRCA) at an age of ⇠ 6 months before the most recent sample included in the alignment. All the remaining priors were left at their default values.
Bayesian inference through Markov chain Monte Carlo (MCMC) was performed for 2 ⇥ 10 8 generations, sampling every 20,000 generations and using the BEAGLE 3.1.2 library to increase computational performance [3]. MCMC convergence and mixing properties were inspected using Tracer 1.7.2 [4] to ensure that e↵ective sample size (ESS) values associated with estimated parameters were all >200. After discarding 10% of sampled trees as burn-in, estimates of the growth rate, molecular clock and tMRCA were extracted along with their posterior distributions ( Figure S2).

Estimates based on epidemic modeling
We obtain an independent estimate for t 0 , the time of the first unreported case, and for other epidemic parameters, such as the e↵ective reproduction number and the generation interval. By indicating with I(t) the number of infected individuals at time t and with D(t) the number of deaths, we consider the stage with the co-circulation of an existing variant v and the emerging one !. Since we consider the final stage of the contagions due to v and the early stage of the contagions due to !, we approximate the Enhancing global preparedness from partial data where I x (t) is the number of infections due to variant x at time t, R x (t 0 ) is the e↵ective reproduction number at time t 0 and GI x is the generation interval. Similarly, the deaths due to the co-circulating variants are approximated by where IFR x denotes the infection fatality rate of variant x and ⌧ x is the lag between infection and death. To fit the unknown parameters, i.e. the ones related to variant !, we use particle swarm optimization [5] to minimize the loss function where I obs (t) and D obs (t) are the number of infected individuals and deaths from empirical data [6], Var indicates the variance in time and ✓ = {t 0 ; R ! (t 0 ); GI ! ; IFR ! ; ⌧ ! } is the vector of the epidemiological parameters characterizing the emerging variant, for which we obtain a joint probability distribution.

Import Risk estimation
International travel dataset compilation We retrieve the monthly seat capacities between airports from the OAG (O cial Airline Guide). Note, that it does not represent the actual passengers that flew from airport A to B in one month, but the maximal capacity, i.e. how many could have travelled if all seats were occupied. It is therefore an upper limit for the passenger flux and we refer to it as the flow matrix F, where F ij describes the maximal passenger flow to i from j. We estimate the travelling population in the catchment area of an airport by we assume that the population is proportional to the outflux of the airport. For each variant, we use the world air-transportation network (WAN) at the month of the outbreak day of the respective variant.   log(p1(m|n0)) the WAN of the WHO outbreak month is used and the WHO outbreak location as source country. The arrival times are taken from the "cov-lineages.org" [7,8] project. . Arrival prediction (r-value) for the 10 best outbreak candidate. The r-value between the import risk distance D IR (m|n0) = log(p1(m|n0)) and the arrival time for the 10 best ranked outbreak countries (n0). The 2 Letters in the circles are the countries ISO alpha-2 codes. The red circle marks the country declared as outbreak country by the WHO.

Quantifying the Import Risk
The import risk method is introduced in a separate study [9] where it is compared to another data-driven estimate. Here we present a short outline of the method. To know how many passengers leave at node j given they started at node i, we introduce the shortest path exit probability q(j|i) (SPEx). It is based on the shortest path tree of the e↵ective distance [10], and combines the exit probability with all possible paths that end in j. The resulting import risk is therefore an extension of the SPEx.
In order to compute the SPEx we first define, with the flow matrix (maximal passenger flux) F and the travelling population of the catchment area N i , the transition matrix P, where the element P ij = F ij / P i F ij = F ij /F j is the probability to transition to i from j. Now, the e↵ective distance graph [10] is D ij = d 0 log(P ij ), with d 0 as the distance o↵set which we set to d 0 = 1 (the larger d 0 the more D ij increases with increasing hop-distance). Let T(n 0 ) be the shortest path tree on D for the point of origin n 0 . With respect to node n the downstream nodes ⌦(n|n 0 ) are those nodes that can be reached from the source n 0 through node n on T(n 0 ). Now we compute the SPEx q(i|n 0 ) by assuming that all passenger that start at n 0 , travel along the shortest path tree T(n 0 ) and distribute to other airports according to their respective populations N n . We assume that the exit probability at i is proportional to the ratio of the population at i (i.e. N i ) to the population of all of i's downstream nodes P n2⌦(i|n 0 ) N n plus N i : Now, we use the SPEx on a random walk that starts at n 0 and the walker exits at node i with probability q(i|n 0 ) or continues its walk with probability 1 q(i|n 0 ). Thus, the probability to be at node m if the walker was before at node m 1 is S(m, m 1|n 0 ) = P m,m 1 (1 q(m 1|n 0 )) . (S7) Consequently, the probability to take a path starting at n 0 and exiting at m is The probability to exit at node m from all possible paths (of all possible lenghts) is (S10) Note that S k (n 0 ) m,n 0 is the probability sum of all paths that started in n 0 and end after k steps in m. We aggregate all airports of the same country by computing the weighted mean with weights with C(n) as the set of airports that belong the same country as node n does.

Relation to distance and arrival time
In order to assess the quality of the import risk, we compare it with the arrival time of past variants. Clearly, the higher the import risk to a country, the earlier it is to arrive and the direct relation between the probability of travel to a city m from a city n 0 and the mean first arrival time t 1 is which is the e↵ective distance [10,11]. Thus, we define the import risk distance as which is proportional to the mean first arrival time.

Alternative distance measures
There are alternative measures to estimate the arrival time [10,12,13], and we want to compare our import risk distance to these established measures. However, please note that the alternative measures have a clear qualitative relation to the arrival time, but it is not possible to directly infer the number of passengers that travel between airports from them (what the import risk is especially designed for). The already introduced alternative measure is the e↵ective distance [10] that uses the flow between airports to estimate the probability to travel from airport n to m Now, the distance along a specific path that connects m and n 0 is the sum of the path elements distances Finally the e↵ective distance from airport n 0 to m, also not directly connected airports, is the minimal e↵ective distance of all possible paths ⌦(m, n 0 ) they are connected through An extension to the e↵ective distance is the random-walk e↵ective distance [13] that considers all possible paths connecting two airports ⌦(m, n 0 ) instead of only taking the dominant path with the shortest distance: Note that the sum of path distances via their exponential is due to the linkage to the arrival time as explained in [13]. We also add a comparison with a metric derived from Di↵usion Distance [12] which exploits the definition of a random walk Laplacian on top of the WAN. We further explain this Information Distance D ID in the dedicated section V.
Country-Level aggregation. The country-level aggregation of the import risk distance D IR is done by first aggregating the import risk on country-level (as described in Sect. II.2) and then applying Eq. S13.
To aggregate the other distances (D ef f , D RW ) we could either take (along the line of D ef f ) the minimal distance between two countries (of all relevant airport pairs), or use a weighted multipath approach as used in the derivation of D RW . We will highlight the latter in the following; however, we also computed the minimal measure and found that it is outperformed by the multipath distance (not shown, but it is the basic finding in [13]).
As shown in [11], the e↵ective distance of two paths combined is . (S18) Thus, the multipath (MP) e↵ective distance that considers all shortest paths from country S to M is: with M as the set of all target airports in country M and S all source airports of country S.
Since the distance of source airports with a larger population are more important, we additionally weight the source airport with w i = F i / P s2s F s , which represents the probability of an infected to start in location n. Now, we compute the population weighted multipath e↵ective distance by Note that the weighting for the e↵ective distance can be reformulated to which corresponds to multiplying the probability to start at the source airport s to the first step of each path. Analogously the

Data for arrival time and outbreak region
We compare the import risk to measured arrival times of di↵erent variants. Therefore, we need to define the outbreak-country and -month and the arrival times. We defined these variables in di↵erent ways.
(I) external sources Here we rely on peer reviewed [8] or o cial [14] sources. The outbreak country and the outbreak month are taken from the website of the World Health Organization (WHO) "Tracking SARS-CoV-2 variants" [14] and the arrival times of the variants Alpha, Beta, Delta, Gamma and Omicron were externally computed with "grinch" [8] and taken from their project website [7]. If arrival times are before the o cial outbreak they are removed from the analysis (for Delta=1, Gamma=1 and Omicron=19 countries are removed).
(II) GISAID data To also use the other variants to validate our import risk method we design a simple arrival time algorithm. First, we need to define the outbreak day. Instead of relying on an o cial definition from the WHO, we use GISAID data. The outbreak time T X,out of variant X is defined by with F X (t) being the fraction of variant X to all sequenced probes at time t and T (F X (t) g · max(F X )) the time when F X (t) crosses the first time the threshold gcdot max(F X ) where g 2]0, 1[ and we set g = 0.025. In other words, the outbreak is defined by 30 days before the variant reached 2.5% of its worldwide peak. We estimate the arrival time of variant X in an country by the most simple way: the first time the variant is detected (according to GISAID data). In Fig. S8 the estimated outbreak time, o cial WHO and arrival times of each country are shown. Since for some variants (Alpha, Delta, BA.2) many arrival times fall clearly before our estimated and even the o cial outbreak date, we recomputed for these countries the arrival time to the first GISAID-detection after the outbreak date. We argue that either (i) the sequencing of the variant in these countries was error-prone (1. count is very sensitive to any wrong detection) or (ii) the spreading was slow and the variant did not dominate the local epidemic until it reached a susceptible country (low NPIs) from where it did spread more easily (probably the case for Delta).
Outbreak detection based on 1st count GISAID data If we repeat the outbreak detection method using all variants and the arrival times estimated via GISAID data (arrival by first detection, Fig. S8), we see that the outbreak detection via the best correlation between import risk distance D IR and arrival times T arrival in general confirms the outbreak regions declared by the WHO (see Figs. S10, S9). There is a discrepancy for Delta. While using WHO and "cov-lineages.org" data, the o cial outbreak country India (IN) was second best, it is only on rank 12 if our GISAID estimates are used. A possible explanation is, that our outbreak date estimation is 5 months after the WHO date. In order to not lose the countries with arrivals before the outbreak date, we recompute the arrivals by the first count after the estimated outbreak date. One can argue that Delta did locally spread much stronger in South Africa (ZA, the top ranked country), and therefore is ZA for the worldwide distribution of larger importance than India. An alternative explanation is that the passenger flow in the WAN was too low and when it increased, ZA had a more active Delta epidemic. . Arrival prediction (r-value) for the 10 best outbreak candidate. The r-value between the import risk distance d1(m|n0) = log(p1(m|n0)) and the arrival time for the 10 best ranked outbreak countries (n0). The 2 Letters in the circles are the countries ISO alpha-2 codes. The red circle marks the country estimated as outbreak country based on GISAID arrival times. In contrast to Fig. S6: the arrival times and outbreak dates are estimated via GISAID data (arrival by first count, outbreak date by reaching the first time 2.5% of worldwide peak of the respective variant).
Fig. S10. Arrival prediction performance (r-value) for the outbreak country candidates. The frequency of the r-value between the import risk distance D IR (m|n0) = log(p1(m|n0)) and the arrival time for all possible outbreak countries. The red vertical line marks the r-value using the country estimated as outbreak country based on GISAID arrival times. In contrast to Fig. S7: the arrival times and outbreak dates are estimated via GISAID data (arrival by first count, outbreak date by reaching the first time 2.5% of worldwide peak of the respective variant).

Epidemic Scenarios
We consider two distinct models to project the number of daily new infected people, namely, a renewal equation based model and a multi-strain SIR-like model. The first one is actually part of the pipeline, while the second one is used as validation.

Renewal equation
The renewal equation approach is a well-known technique, widely used in epidemiology [15,16,17]. The reason why renewal equations are such strong candidates for early projection of new cases, is the fact that informing them requires only the reproduction number of the new variant of concern, its generation interval distribution, and the number of people infected by the new variant who travel into the target country from the source country. This allows easily to explore scenarios with di↵erent values of epidemiological quantities of interest, such as the e↵ective reproduction number of a new variant as it spreads from the source country to others through travelers.
For now, we assume that the susceptible population is much larger than the number of active cases, and that the mixing between the infected and the susceptible is homogeneous. This allows to exclude feedback loops in the dynamics, e.g. the fact that immunity to the new variant builds up through infection, which would modify the dynamics itself. Such strong assumptions are acceptable as long as we restrict our projections to the very first few weeks from the introduction of the new variant in the target country.
The model assumes that the number of newly infected people at day t, I(t), is given by two distinct processes: a) the arrival of infected individuals from the source country (I out (t)) and b) the daily new infections (I in (t)) happening in the target country due to the endogenous spreading. The former is estimated from section II, while the latter can be estimated through the renewal equation where t 0 is the day the first infected cases arrived in the target country, R s is the daily reproductive number on day s, and s is the generation time distribution, i.e. the fraction of transmissions that would occur on day s after infection. Finally I(t) = I out (t) + I in (t). This is the simplest renewal process, which does not include the fact that the target population might have an inhomogeneous immunological landscape, due to previous infections or vaccination. To model this phenomenon, we reinterpret the term on the right side of equation (S23) as the number of inoculations spreading from currently infecting people, which will turn into infections depending on the susceptibility of the recipients. If we assume that previous infections (with other variants) protect against reinfection with an e cacy of n e , and, analogously, vaccination has an e↵ectiveness of ⌫ e , then we can explicitly account for removals by modifying equation (S23) into where R old (t) is the number of recovered people from previous variants that still have some protection against infections, and V (t) is the total number of vaccinated people. This assumes that the number of recovered or vaccinated people is uniformly distributed across the population, and that the events 'being vaccinated' and 'having been infected' are independent. This also assumes no gradual waning of protection against infection. However, we can consider as recovered or vaccinated only people who were infected or vaccinated recently, rather than from the beginning of the pandemic. For instance, considering only people who got either infected or their second dose up to six months prior to t is equivalent to assuming that there is an abrupt waning of e cacy against protection six months after getting infected or vaccinated. Although these hypotheses might seem unrealistic, the lack of readily available data about waning and immunological landscapes of various countries, and the fact that this should be used only for short-term scenario explorations, allow us to avoid introducing further complexity into the model.
The cumulative number of cases and amount of fully vaccinated individuals at each day are the ones reported in the public repository at 1 . We select the values for vaccine e cacy and protection from previous infection from available works. In particular we set the vaccine e cacy ⌫ e to 0 for Alpha, 0.5 for Delta, BA1 and BA2 and to 0.12 for BA.5 ( [18, 19, 20, 21]). The selected protection against reinfection n e is 1 for Delta, 0.56 for BA.1 and BA.2 Omicron lineages and 0.13 for BA.5 ( [22, 23, 21]).
The second model is a multi-strain SIR inspired by [24]. This is a two-strain model in which people who recover after being infected with the former variant are not completely immune to infection from the latter variant. The equations governing this system are where i (t) = i o . Note that, since I out (t) represents the arrivals from the source country at the beginning of each day, the system is not closed. This is not a problem because we are considering countries, so I out (t) N ⌧ 1. Since the dynamics does not include, per se, the fact that the initial condition changes every day due to arrivals, we can solve this system on a daily basis, updating the initial condition and restarting the system accordingly. The advantage of this system is that it includes feedback phenomena, which is good when validating the model, as it may need to run for more than a few weeks. The drawbacks are that informing the model requires good point estimates of the various compartments, and the interpretation of the transmissibility coe cient related to the measured R t , which may not be straight-forward. For such reasons, this model is used to validate the renewal equation approach, in particular for countries where no new cases were observed after a few weeks from their emergence (not shown). Projections errors valuated with the SIR model relative to Alpha lineage are shown in A fully worked out example: the Alpha variant We apply our pipeline to a real case, the Alpha variant of concern (VOC), that was identified in the UK on 20 September 2020 [8]. We assume that the UK is the source country and we demonstrate how the pipeline works. In the following, we consider as the generation time interval distribution the one inferred from the literature [25].
Starting from the phylogenetic part of our pipeline, we take the time of emergence estimated when n = 20 sequences were collected, to simulate a realistic scenario where only little information is available. This gives a central estimate for the time of emergence of the Alpha variant around the 9 th of November 2020. The daily growth rate estimated is r = 0.097 (95% HPD: 0.008-0.202). To translate this into R t in the source country, we assume that all the growth rate advantage of Alpha relative to the previous circulating variants is given only by transmission advantage (limited capacity of reinfections with Alpha). Further, typical generation time distributions are Gammas, as in [25]. This allows us to estimate the R t using formula 2.2 in [26]: where b and a are the shape and rate of the Gamma distribution generation time. In our case, a = 5.9, b = 1.13, therefore R t(↵) = 1.62(1.04, 2.63). For any target country, the projection of the number of cases infected with Alpha in the next weeks is performed in two steps: first, we estimate the number of infected travelers (referred to as seeds) who arrive in the target country from the source country, then we use the renewal equation (S24) on each possible scenario, to account for endogenous transmission of the secondary cases in the target country. The first step consists in using the import risk estimates described in section II to compute the number of daily travelers from source country to other target countries. We use import risk probability from source to target times the average daily outflow of passengers from source country using WAN data. We then determined the number of travelers infected with Alpha. This is done by considering the proportion of sequenced cases that are Alpha times the 7 day moving average of daily incidence of new cases, assuming that sequences are taken randomly from the infected population. This estimate does not include undercounting in the source country, which we can estimate as follows.
For a given country, we use the daily new estimated COVID-19 infections from the IHME model, which is a hybrid with two main components: a statistical "death model" component produces death estimates that are used to fit an SEIR model component 2 . For a complete overview of this model and a comparison with other estimates, we refer to OWID 3 . The data we used for our estimation are publicly available 4 . In a given temporal window, we integrate over time the confirmed number of cases (7d moving average) and the estimated true number of cases, as well as the estimates for its lower and upper bounds defining the 95% uncertainty interval. The mean undercounting factor is estimated by the ratio between the integrated estimate of the true cases and the confirmed ones in the temporal window, and similarly we estimate the corresponding uncertainty interval. We show in Fig. SS11 the undercounting factor obtained for all countries for which the data is available, whereas Fig. SS12 shows the evolution of this factor along periods of 6 months for some representative countries.
To allow for variability in undercounting, we consider two extreme scenarios: the best one, where undercounting is assumed to be 2.27, and the worst one, where undercounting is assumed to be 2.97. The number of infected travelers from the source country to the target country is then computed by multiplying the number of travelers into the target country by the proportion of infected people in the source country. This is often not a natural number. This is not a problem, as the renewal equation does not need to use integer number of infected people, and we interpret this as the results of the various averaging performed through all the steps. The model produces the total number of infected people in the target country given the seeds and the R t by day of infection. To validate the model, we need to estimate how many people infected with the VOC were present in the target country during the considered period. We do so in the same way we estimate prevalence in the source country: by multiplying the proportion of sequenced cases that turned out to be Alpha times the daily incidence in the target country, scaled by the estimated undercounting factor.
The total number of di↵erent scenarios computed is, in this case 2 ⇥ 2 ⇥ 3: undercounting in both the source and the target countries, and the di↵erent reproduction number of the VOC. Results are shown in Figure 3C and in Figure S13A.

Prediction error
For each lineage we evaluate di↵erent scenarios with a) low and high values of underreporting in both source and target country b) three di↵erent basic reproduction numbers R t that correspond to the range of growth rate values estimated from the phylogenetic reconstruction.
We infer from data the number of infected individuals with the emerging lineage in the target country m and we evaluate the prediction error as zero if this estimated number is included in the range identified by di↵erent epidemic scenarios. If the number of infected people evaluated from data is out of the range spanned by the epidemic curves, then the prediction error is evaluated as the root-mean-square error, normalized to the range of the data observed in the target country m, between observed and the closest simulated epidemic curve: where n t is the number of weeks with number of sequences greater than zero for the selected lineage in the considered country m, that is n t is the number of available data points with not null infected people. Since the scenario simulations stop at the 3 week after sequencing was reported in country m, n t is always n t = 2. The idea behind the normalization by the data range is that it reflects the noise of reported sequences, i.e. if the sequencing rate is low, we expect a large variation and the sequencing data is less reliable. Prediction errors evaluated for all the considered lineages are shown in Figure 3 of the main document. All the panels report the nRMSE in each country as a function of both the number of daily passengers normalized to the total country population (x-axis, values for 100000 individuals) and the number of total collected daily sequences normalized to the total number of confirmed cases (y-axis, values for 100000 cases). Insets show the evaluated error in each country. Results assess that, in most of the country, the simulated scenarios encompass the data and the prediction error is evaluated as zero. Moreover, error values greater than zero can be found for countries with higher passenger flows.

Pandemic delay
The pandemic delay estimates the time needed since tMRCA for a specific variant to reach a certain percentage y in a target country. It depends in general on a large variety of factors as the reproduction number, the fraction of vaccinated, the variant's immune escape, season, weather conditions, the number, duration and strength of active non-pharmaceutical interventions (NPI), the national and international mobility and the epidemic situation. In the following estimation of the pandemic delay, we assume that the main driver/predictors for the pandemic risk are the international mobility, the e↵ective reproduction number and the country specific epidemic situation. We will use a simple framework to combine the measures that is based on the replicator equation [27], stating that the fraction of a new variant can be described by a simple logistic growth equation (illustrated for Delta lineage in Figure S15). It assumes that there are 2 competing populations, the mixed population of all preexisting variants of size N pre and the population of the emergent variant N x . According to the replicator equation, the evolution of the fraction x of the new variant in the whole population corresponds to with f as the fitness of the new variant x andf as the mean fitness, i.e.
We can therefore rewrite the time-evolution to that has the logistic function as general solution with x 0 as initial condition being the imported infected cases from the country of origin n 0 to the target country m with t 0 = tMRCA, U r (t 0 , m) as the underreporting factor of cases in country m (introduced in Sec. III.2), as the probability of leaving the country via the WAN and p 1 (m|n 0 ) as the import risk (see Sec. II). Note that with Eqs. S31, S32 we assume that the initial import x 0 dominates, i.e. imports at later times can be neglected (otherwise a constant flux needs to be implemented). The fitness di↵erence between the new variant vs. the already existing variant mix is approximated by i.e. we assume that the reproduction number of the preexisting variant mix is one, motivated by the observed fluctuations around R pre = 1 due to the behavioral and/or medical adaptation to the local epidemic situation. The pandemic delay t y is the time needed for the new variant to reach the fraction y of the infected population, where t y (m) for a specific country m is (rearranging Eq. S31) (S34) We can further simplify the pandemic delay by assuming that the initial import is small However, this simplification is merely meant as a help to ease understanding of the functional relations. In the manuscript, we use explicitly Eq. S34. Fig. S15. The fraction of seq. on GISAID attributed to the Delta variant for four example countries. As described for the Alpha variant by Fort [27], the relative fraction of a new variant can be accurately described by a simple logistic growth equation (Eq. S31).

Information Distance
We also devise an alternative definition of distance on top of a network which embeds information from multiple-pathways di↵usion as an additional comparison to the import risk measure. Distances based on the di↵usive properties of the system have been of interest in recent years [10,13]. Another key example is the Di↵usion Distance [12] which estimates a metric distance between nodes based on how similarly the random walkers explore the network by using those nodes as sources, under the assumption that a mesoscale structure is recovered during the time scales in which the random walker explores its functional community. Starting from Di↵usion Distance definition, we propose an educated rewrite of the measure that fits the problem under study to predict arrival times of a random walker on the network, such as an infectious traveler from a source country. The probability p(t | i) of a walker to be in any point in the network at time t, starting from node i, embeds information of multiple paths via successive applications of the Laplacian operator. We introduce a new measure that merges this concept from Di↵usion Distance and also embeds information from E↵ective Distance [10], namely, the idea that low probabilities p k (t | i) are associated with large distances. This can be embedded by taking the negative of the logarithm of the probability, in analogy with Shannon's entropy. We now introduce this candidate measure for di↵usive dynamics which we define Information Distance: D ID (s!k) (t) = log 10 p k (t | s) ( S 3 6 ) in which p k (t | s) represents the k th entry associated with node k of the probability state p(t | s) = v s · e tL RW .
Here v s is the initial condition probability for the walker starting from node s, the canonical vector with s-th component equal to 1. The random walk normalized Laplacian (L RW ) [28] term encodes the probability to move from node i to node j in its matrix elements. Its o↵-diagonal terms can be computed as the negative value of P ij , which is directly estimated from the WAN weighted links as stated in subsection II.2. Given the multiple timescales involved in this definition, we evaluate the metric at di↵erent scales t to find the timescale at which D ID (t) performs better. ) and the weekly collected sequenced samples (based on GISAID-data [20]). We compute the final sequencing rate (Seq. Rate) by dividing through the underreporting factor (Underrep. Fact.) whose estimation is described in Sec. III.2. Both estimates are averaged for the lineage respective time-period between the median time of the most recent common ancestor (tMRCA) and the time when the first 50 samples got collected (t 50S ).